Introduction to Spark

Basic initialization

SparkSession is used to connect to the Spark Cluster.

In [191]:
from pyspark.sql import SparkSession

We will use Pandas to operate on the reduced data in the driver program.

In [192]:
import pandas as pd

Numpy will be always useful.

In [193]:
import numpy as np

Create a new session (or reuse an existing one).

In [194]:
spark = SparkSession.builder.getOrCreate()
In [195]:
spark
Out[195]:

SparkSession - in-memory

SparkContext

Spark UI

Version
v3.0.0
Master
local[*]
AppName
pyspark-shell

We can see that the session is established.

Creating Spark Data Frames from Pandas

We can list the tables in our Spark Session, currently empty.

In [196]:
print(spark.catalog.listTables())
[Table(name='airports', database=None, description=None, tableType='TEMPORARY', isTemporary=True), Table(name='countries', database=None, description=None, tableType='TEMPORARY', isTemporary=True), Table(name='temp', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]

We can create a Pandas DataFrame with random values.

In [197]:
pd_temp = pd.DataFrame(np.random.random(100))

We can see on the plot that it is really random:

In [198]:
import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')
In [199]:
pd_temp.plot()
Out[199]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8a49f47790>

Now we can convert it into Spark DataFrame:

In [200]:
spark_temp = spark.createDataFrame(pd_temp)

createOrReplaceTempView creates (or replaces if that view name already exists) a lazily evaluated "view" that you can then use like a table in Spark SQL.

It does not persist to memory unless you cache (persist) the dataset that underpins the view.

In [201]:
spark_temp.createOrReplaceTempView("temp")

The created view is TEMPORARY which means it is not persistent.

In [202]:
print(spark.catalog.listTables())
[Table(name='airports', database=None, description=None, tableType='TEMPORARY', isTemporary=True), Table(name='countries', database=None, description=None, tableType='TEMPORARY', isTemporary=True), Table(name='temp', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]
In [203]:
spark_temp.show()
+-------------------+
|                  0|
+-------------------+
| 0.5598159622791349|
|  0.692581151690837|
| 0.5566890799700385|
|0.21320317348979967|
| 0.9533064740563738|
| 0.9515109966664301|
|0.08971302159506889|
|  0.918730490544433|
| 0.9863597143334963|
| 0.9309464183408591|
|0.45982479145371546|
| 0.8412565089823657|
|0.22967249824400982|
| 0.5745940979467912|
|0.43512869450388314|
| 0.7313710327794464|
| 0.8748768406110943|
| 0.8378203990704497|
|0.17154499120646216|
| 0.5025281363634896|
+-------------------+
only showing top 20 rows

We can now use transformations on this DataFrame. The transformations are translated (compiled) to RDD transformations.

In [204]:
from pyspark.sql.functions import col, asc
In [205]:
spark_temp.filter((col('0') > 0.9)).show()
+------------------+
|                 0|
+------------------+
|0.9533064740563738|
|0.9515109966664301|
| 0.918730490544433|
|0.9863597143334963|
|0.9309464183408591|
|0.9776064556660955|
|0.9027173675223089|
|0.9447483276871228|
|0.9388582279932038|
|0.9305454585898081|
| 0.905766797206726|
|0.9745006987100845|
|0.9525529214372624|
|0.9615659384986127|
+------------------+

Creating Spark Data Frames from input files

In [206]:
file_path = "airports.csv"

# Read in the airports data
airports = spark.read.csv(file_path,header=True)

# Show the data
print(airports.show())
+---+--------------------+------------------+----------------+---+----+------------------+-------------------+----+---+---+--------------------+-------+-----------+
|  1|      Goroka Airport|            Goroka|Papua New Guinea|GKA|AYGA|-6.081689834590001|      145.391998291|5282| 10|  U|Pacific/Port_Moresby|airport|OurAirports|
+---+--------------------+------------------+----------------+---+----+------------------+-------------------+----+---+---+--------------------+-------+-----------+
|  2|      Madang Airport|            Madang|Papua New Guinea|MAG|AYMD|    -5.20707988739|      145.789001465|  20| 10|  U|Pacific/Port_Moresby|airport|OurAirports|
|  3|Mount Hagen Kagam...|       Mount Hagen|Papua New Guinea|HGU|AYMH|-5.826789855957031| 144.29600524902344|5388| 10|  U|Pacific/Port_Moresby|airport|OurAirports|
|  4|      Nadzab Airport|            Nadzab|Papua New Guinea|LAE|AYNZ|         -6.569803|         146.725977| 239| 10|  U|Pacific/Port_Moresby|airport|OurAirports|
|  5|Port Moresby Jack...|      Port Moresby|Papua New Guinea|POM|AYPY|-9.443380355834961| 147.22000122070312| 146| 10|  U|Pacific/Port_Moresby|airport|OurAirports|
|  6|Wewak Internation...|             Wewak|Papua New Guinea|WWK|AYWK|    -3.58383011818|      143.669006348|  19| 10|  U|Pacific/Port_Moresby|airport|OurAirports|
|  7|  Narsarsuaq Airport|      Narssarssuaq|       Greenland|UAK|BGBW|     61.1604995728|     -45.4259986877| 112| -3|  E|     America/Godthab|airport|OurAirports|
|  8|Godthaab / Nuuk A...|          Godthaab|       Greenland|GOH|BGGH|       64.19090271|     -51.6781005859| 283| -3|  E|     America/Godthab|airport|OurAirports|
|  9|Kangerlussuaq Air...|       Sondrestrom|       Greenland|SFJ|BGSF|     67.0122218992|     -50.7116031647| 165| -3|  E|     America/Godthab|airport|OurAirports|
| 10|      Thule Air Base|             Thule|       Greenland|THU|BGTL|     76.5311965942|     -68.7032012939| 251| -4|  E|       America/Thule|airport|OurAirports|
| 11|    Akureyri Airport|          Akureyri|         Iceland|AEY|BIAR| 65.66000366210938| -18.07270050048828|   6|  0|  N|  Atlantic/Reykjavik|airport|OurAirports|
| 12| Egilsstaðir Airport|       Egilsstadir|         Iceland|EGS|BIEG|  65.2833023071289|-14.401399612426758|  76|  0|  N|  Atlantic/Reykjavik|airport|OurAirports|
| 13|Hornafjörður Airport|              Hofn|         Iceland|HFN|BIHN|         64.295601|           -15.2272|  24|  0|  N|  Atlantic/Reykjavik|airport|OurAirports|
| 14|     Húsavík Airport|           Husavik|         Iceland|HZK|BIHU|         65.952301|         -17.426001|  48|  0|  N|  Atlantic/Reykjavik|airport|OurAirports|
| 15|  Ísafjörður Airport|        Isafjordur|         Iceland|IFJ|BIIS| 66.05809783935547|-23.135299682617188|   8|  0|  N|  Atlantic/Reykjavik|airport|OurAirports|
| 16|Keflavik Internat...|          Keflavik|         Iceland|KEF|BIKF|   63.985000610352|   -22.605600357056| 171|  0|  N|  Atlantic/Reykjavik|airport|OurAirports|
| 17|Patreksfjörður Ai...|    Patreksfjordur|         Iceland|PFJ|BIPA|         65.555801|            -23.965|  11|  0|  N|  Atlantic/Reykjavik|airport|OurAirports|
| 18|   Reykjavik Airport|         Reykjavik|         Iceland|RKV|BIRK|     64.1299972534|     -21.9405994415|  48|  0|  N|  Atlantic/Reykjavik|airport|OurAirports|
| 19|Siglufjörður Airport|      Siglufjordur|         Iceland|SIJ|BISI|         66.133301|           -18.9167|  10|  0|  N|  Atlantic/Reykjavik|airport|OurAirports|
| 20|Vestmannaeyjar Ai...|    Vestmannaeyjar|         Iceland|VEY|BIVM| 63.42430114746094|-20.278900146484375| 326|  0|  N|  Atlantic/Reykjavik|airport|OurAirports|
| 21|Sault Ste Marie A...|Sault Sainte Marie|          Canada|YAM|CYAM| 46.48500061035156|  -84.5093994140625| 630| -5|  A|     America/Toronto|airport|OurAirports|
+---+--------------------+------------------+----------------+---+----+------------------+-------------------+----+---+---+--------------------+-------+-----------+
only showing top 20 rows

None

It may be useful to convert them to Pandas for quick browsing.

Warning! This is not efficient for large datasets, as it requires performing actions on the dataset.

In [207]:
airports.toPandas()
Out[207]:
1 Goroka Airport Goroka Papua New Guinea GKA AYGA -6.081689834590001 145.391998291 5282 10 U Pacific/Port_Moresby airport OurAirports
0 2 Madang Airport Madang Papua New Guinea MAG AYMD -5.20707988739 145.789001465 20 10 U Pacific/Port_Moresby airport OurAirports
1 3 Mount Hagen Kagamuga Airport Mount Hagen Papua New Guinea HGU AYMH -5.826789855957031 144.29600524902344 5388 10 U Pacific/Port_Moresby airport OurAirports
2 4 Nadzab Airport Nadzab Papua New Guinea LAE AYNZ -6.569803 146.725977 239 10 U Pacific/Port_Moresby airport OurAirports
3 5 Port Moresby Jacksons International Airport Port Moresby Papua New Guinea POM AYPY -9.443380355834961 147.22000122070312 146 10 U Pacific/Port_Moresby airport OurAirports
4 6 Wewak International Airport Wewak Papua New Guinea WWK AYWK -3.58383011818 143.669006348 19 10 U Pacific/Port_Moresby airport OurAirports
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7178 12053 Rugao Air Base Rugao China RUG ZSRG 32.25788497924805 120.50165557861328 0 \N \N \N airport OurAirports
7179 12054 Wuhu Air Base Wuhu China WHU ZSWU 31.3906 118.408997 0 \N \N \N airport OurAirports
7180 12055 Shanshan Airport Shanshan China SXJ ZWSS 42.91170120239258 90.24749755859375 0 \N \N \N airport OurAirports
7181 12056 Yingkou Lanqi Airport Yingkou China YKH ZYYK 40.542524 122.3586 0 \N \N \N airport OurAirports
7182 12057 Shenyang Dongta Airport Shenyang China \N ZYYY 41.784400939941406 123.49600219726562 0 \N \N \N airport OurAirports

7183 rows × 14 columns

Running SQL queries on dataframes

In [208]:
airports.createOrReplaceTempView("airports")
In [209]:
# Get the first 10 rows of flights
query = "FROM airports SELECT * LIMIT 10"

airports10 = spark.sql(query)

# Show the results
airports10.show()
+---+--------------------+------------+----------------+---+----+------------------+------------------+----+---+---+--------------------+-------+-----------+
|  1|      Goroka Airport|      Goroka|Papua New Guinea|GKA|AYGA|-6.081689834590001|     145.391998291|5282| 10|  U|Pacific/Port_Moresby|airport|OurAirports|
+---+--------------------+------------+----------------+---+----+------------------+------------------+----+---+---+--------------------+-------+-----------+
|  2|      Madang Airport|      Madang|Papua New Guinea|MAG|AYMD|    -5.20707988739|     145.789001465|  20| 10|  U|Pacific/Port_Moresby|airport|OurAirports|
|  3|Mount Hagen Kagam...| Mount Hagen|Papua New Guinea|HGU|AYMH|-5.826789855957031|144.29600524902344|5388| 10|  U|Pacific/Port_Moresby|airport|OurAirports|
|  4|      Nadzab Airport|      Nadzab|Papua New Guinea|LAE|AYNZ|         -6.569803|        146.725977| 239| 10|  U|Pacific/Port_Moresby|airport|OurAirports|
|  5|Port Moresby Jack...|Port Moresby|Papua New Guinea|POM|AYPY|-9.443380355834961|147.22000122070312| 146| 10|  U|Pacific/Port_Moresby|airport|OurAirports|
|  6|Wewak Internation...|       Wewak|Papua New Guinea|WWK|AYWK|    -3.58383011818|     143.669006348|  19| 10|  U|Pacific/Port_Moresby|airport|OurAirports|
|  7|  Narsarsuaq Airport|Narssarssuaq|       Greenland|UAK|BGBW|     61.1604995728|    -45.4259986877| 112| -3|  E|     America/Godthab|airport|OurAirports|
|  8|Godthaab / Nuuk A...|    Godthaab|       Greenland|GOH|BGGH|       64.19090271|    -51.6781005859| 283| -3|  E|     America/Godthab|airport|OurAirports|
|  9|Kangerlussuaq Air...| Sondrestrom|       Greenland|SFJ|BGSF|     67.0122218992|    -50.7116031647| 165| -3|  E|     America/Godthab|airport|OurAirports|
| 10|      Thule Air Base|       Thule|       Greenland|THU|BGTL|     76.5311965942|    -68.7032012939| 251| -4|  E|       America/Thule|airport|OurAirports|
| 11|    Akureyri Airport|    Akureyri|         Iceland|AEY|BIAR| 65.66000366210938|-18.07270050048828|   6|  0|  N|  Atlantic/Reykjavik|airport|OurAirports|
+---+--------------------+------------+----------------+---+----+------------------+------------------+----+---+---+--------------------+-------+-----------+

More complex examples

Read data from CSV file:

  • inferSchema - to detect which columns are numbers (not strigs!) - useful e.g. for sorting.
  • header - to read the firs line as column names
In [210]:
countries = spark.read.csv("countries of the world.csv",inferSchema=True,header=True)
In [211]:
countries.toPandas()
Out[211]:
Country Region Population Area (sq. mi.) Pop. Density (per sq. mi.) Coastline (coast/area ratio) Net migration Infant mortality (per 1000 births) GDP ($ per capita) Literacy (%) Phones (per 1000) Arable (%) Crops (%) Other (%) Climate Birthrate Deathrate Agriculture Industry Service
0 Afghanistan ASIA (EX. NEAR EAST) 31056997 647500 48,0 0,00 23,06 163,07 700.0 36,0 3,2 12,13 0,22 87,65 1 46,6 20,34 0,38 0,24 0,38
1 Albania EASTERN EUROPE 3581655 28748 124,6 1,26 -4,93 21,52 4500.0 86,5 71,2 21,09 4,42 74,49 3 15,11 5,22 0,232 0,188 0,579
2 Algeria NORTHERN AFRICA 32930091 2381740 13,8 0,04 -0,39 31 6000.0 70,0 78,1 3,22 0,25 96,53 1 17,14 4,61 0,101 0,6 0,298
3 American Samoa OCEANIA 57794 199 290,4 58,29 -20,71 9,27 8000.0 97,0 259,5 10 15 75 2 22,46 3,27 None None None
4 Andorra WESTERN EUROPE 71201 468 152,1 0,00 6,6 4,05 19000.0 100,0 497,2 2,22 0 97,78 3 8,71 6,25 None None None
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
222 West Bank NEAR EAST 2460492 5860 419,9 0,00 2,98 19,62 800.0 None 145,2 16,9 18,97 64,13 3 31,67 3,92 0,09 0,28 0,63
223 Western Sahara NORTHERN AFRICA 273008 266000 1,0 0,42 None None NaN None None 0,02 0 99,98 1 None None None None 0,4
224 Yemen NEAR EAST 21456188 527970 40,6 0,36 0 61,5 800.0 50,2 37,2 2,78 0,24 96,98 1 42,89 8,3 0,135 0,472 0,393
225 Zambia SUB-SAHARAN AFRICA 11502010 752614 15,3 0,00 0 88,29 800.0 80,6 8,2 7,08 0,03 92,9 2 41 19,93 0,22 0,29 0,489
226 Zimbabwe SUB-SAHARAN AFRICA 12236805 390580 31,3 0,00 0 67,69 1900.0 90,7 26,8 8,32 0,34 91,34 2 28,01 21,84 0,179 0,243 0,579

227 rows × 20 columns

We can inspect the schema of the DataFrame.

In [212]:
countries.printSchema()
root
 |-- Country: string (nullable = true)
 |-- Region: string (nullable = true)
 |-- Population: integer (nullable = true)
 |-- Area (sq. mi.): integer (nullable = true)
 |-- Pop. Density (per sq. mi.): string (nullable = true)
 |-- Coastline (coast/area ratio): string (nullable = true)
 |-- Net migration: string (nullable = true)
 |-- Infant mortality (per 1000 births): string (nullable = true)
 |-- GDP ($ per capita): integer (nullable = true)
 |-- Literacy (%): string (nullable = true)
 |-- Phones (per 1000): string (nullable = true)
 |-- Arable (%): string (nullable = true)
 |-- Crops (%): string (nullable = true)
 |-- Other (%): string (nullable = true)
 |-- Climate: string (nullable = true)
 |-- Birthrate: string (nullable = true)
 |-- Deathrate: string (nullable = true)
 |-- Agriculture: string (nullable = true)
 |-- Industry: string (nullable = true)
 |-- Service: string (nullable = true)

Examples of SQL Queries

In [213]:
countries.createOrReplaceTempView("countries")
In [214]:
spark.sql("SELECT * FROM countries WHERE Region LIKE '%OCEANIA%'").toPandas()
Out[214]:
Country Region Population Area (sq. mi.) Pop. Density (per sq. mi.) Coastline (coast/area ratio) Net migration Infant mortality (per 1000 births) GDP ($ per capita) Literacy (%) Phones (per 1000) Arable (%) Crops (%) Other (%) Climate Birthrate Deathrate Agriculture Industry Service
0 American Samoa OCEANIA 57794 199 290,4 58,29 -20,71 9,27 8000 97,0 259,5 10 15 75 2 22,46 3,27 None None None
1 Australia OCEANIA 20264082 7686850 2,6 0,34 3,98 4,69 29000 100,0 565,5 6,55 0,04 93,41 1 12,14 7,51 0,038 0,262 0,7
2 Cook Islands OCEANIA 21388 240 89,1 50,00 None None 5000 95,0 289,9 17,39 13,04 69,57 2 21 None 0,151 0,096 0,753
3 Fiji OCEANIA 905949 18270 49,6 6,18 -3,14 12,62 5800 93,7 112,6 10,95 4,65 84,4 2 22,55 5,65 0,089 0,135 0,776
4 French Polynesia OCEANIA 274578 4167 65,9 60,60 2,94 8,44 17500 98,0 194,5 0,82 5,46 93,72 2 16,68 4,69 0,031 0,19 0,769
5 Guam OCEANIA 171019 541 316,1 23,20 0 6,94 21000 99,0 492,0 9,09 16,36 74,55 2 18,79 4,48 None None None
6 Kiribati OCEANIA 105432 811 130,0 140,94 0 48,52 800 None 42,7 2,74 50,68 46,58 2 30,65 8,26 0,089 0,242 0,668
7 Marshall Islands OCEANIA 60422 11854 5,1 3,12 -6,04 29,45 1600 93,7 91,2 16,67 38,89 44,44 2 33,05 4,78 0,317 0,149 0,534
8 Micronesia, Fed. St. OCEANIA 108004 702 153,9 870,66 -20,99 30,21 2000 89,0 114,8 5,71 45,71 48,58 2 24,68 4,75 0,289 0,152 0,559
9 Nauru OCEANIA 13287 21 632,7 142,86 0 9,95 5000 None 143,0 0 0 100 2 24,76 6,7 None None None
10 New Caledonia OCEANIA 219246 19060 11,5 11,83 0 7,72 15000 91,0 252,2 0,38 0,33 99,29 2 18,11 5,69 0,15 0,088 0,762
11 New Zealand OCEANIA 4076140 268680 15,2 5,63 4,05 5,85 21600 99,0 441,7 5,6 6,99 87,41 3 13,76 7,53 0,043 0,273 0,684
12 N. Mariana Islands OCEANIA 82459 477 172,9 310,69 9,61 7,11 12500 97,0 254,7 13,04 4,35 82,61 2 19,43 2,29 None None None
13 Palau OCEANIA 20579 458 44,9 331,66 2,85 14,84 9000 92,0 325,6 8,7 4,35 86,95 2 18,03 6,8 0,062 0,12 0,818
14 Papua New Guinea OCEANIA 5670544 462840 12,3 1,11 0 51,45 2200 64,6 10,9 0,46 1,44 98,1 2 29,36 7,25 0,353 0,381 0,266
15 Samoa OCEANIA 176908 2944 60,1 13,69 -11,7 27,71 5600 99,7 75,2 21,2 24,38 54,42 2 16,43 6,62 0,114 0,584 0,302
16 Solomon Islands OCEANIA 552438 28450 19,4 18,67 0 21,29 1700 None 13,4 0,64 2 97,36 2 30,01 3,92 0,42 0,11 0,47
17 Tonga OCEANIA 114689 748 153,3 56,02 0 12,62 2200 98,5 97,7 23,61 43,06 33,33 2 25,37 5,28 0,23 0,27 0,5
18 Tuvalu OCEANIA 11810 26 454,2 92,31 0 20,03 1100 None 59,3 0 0 100 2 22,18 7,11 0,166 0,272 0,562
19 Vanuatu OCEANIA 208869 12200 17,1 20,72 0 55,16 2900 53,0 32,6 2,46 7,38 90,16 2 22,72 7,82 0,26 0,12 0,62
20 Wallis and Futuna OCEANIA 16025 274 58,5 47,08 None None 3700 50,0 118,6 5 25 70 2 None None None None None

Queries using PySpark DSL

DSL = Domain Specific Language - API similar to natural or other language, implemented as library in another language.

List all the countries with the population > 38 million

In [215]:
countries.filter((col("Population") > 38000000)).orderBy("Population").toPandas()
Out[215]:
Country Region Population Area (sq. mi.) Pop. Density (per sq. mi.) Coastline (coast/area ratio) Net migration Infant mortality (per 1000 births) GDP ($ per capita) Literacy (%) Phones (per 1000) Arable (%) Crops (%) Other (%) Climate Birthrate Deathrate Agriculture Industry Service
0 Poland EASTERN EUROPE 38536869 312685 123,3 0,16 -0,49 8,51 11100 99,8 306,3 45,91 1,12 52,97 3 9,85 9,89 0,05 0,311 0,64
1 Argentina LATIN AMER. & CARIB 39921833 2766890 14,4 0,18 0,61 15,18 11200 97,1 220,4 12,31 0,48 87,21 3 16,73 7,55 0,095 0,358 0,547
2 Spain WESTERN EUROPE 40397842 504782 80,0 0,98 0,99 4,42 22000 97,9 453,5 26,07 9,87 64,06 3 10,06 9,72 0,04 0,295 0,665
3 Sudan SUB-SAHARAN AFRICA 41236378 2505810 16,5 0,03 -0,02 62,5 1900 61,1 16,3 6,83 0,18 92,99 2 34,53 8,97 0,387 0,203 0,41
4 Colombia LATIN AMER. & CARIB 43593035 1138910 38,3 0,28 -0,31 20,97 6300 92,5 176,2 2,42 1,67 95,91 2 20,48 5,58 0,125 0,342 0,533
5 South Africa SUB-SAHARAN AFRICA 44187637 1219912 36,2 0,23 -0,29 61,81 10700 86,4 107,0 12,08 0,79 87,13 1 18,2 22 0,025 0,303 0,671
6 Ukraine C.W. OF IND. STATES 46710816 603700 77,4 0,46 -0,39 20,34 5400 99,7 259,9 56,21 1,61 42,18 3 8,82 14,39 0,187 0,452 0,361
7 Burma ASIA (EX. NEAR EAST) 47382633 678500 69,8 0,28 -1,8 67,24 1800 85,3 10,1 15,19 0,97 83,84 2 17,91 9,83 0,564 0,082 0,353
8 Korea, South ASIA (EX. NEAR EAST) 48846823 98480 496,0 2,45 0 7,05 17800 97,9 486,1 17,18 1,95 80,87 3 10 5,85 0,033 0,403 0,563
9 Italy WESTERN EUROPE 58133509 301230 193,0 2,52 2,07 5,94 26700 98,6 430,9 27,79 9,53 62,68 None 8,72 10,4 0,021 0,291 0,688
10 United Kingdom WESTERN EUROPE 60609153 244820 247,6 5,08 2,19 5,16 27700 99,0 543,5 23,46 0,21 76,33 3 10,71 10,13 0,005 0,237 0,758
11 France WESTERN EUROPE 60876136 547030 111,3 0,63 0,66 4,26 27600 99,0 586,4 33,53 2,07 64,4 4 11,99 9,14 0,022 0,214 0,764
12 Congo, Dem. Rep. SUB-SAHARAN AFRICA 62660551 2345410 26,7 0,00 0 94,69 700 65,5 0,2 2,96 0,52 96,52 2 43,69 13,27 0,55 0,11 0,34
13 Thailand ASIA (EX. NEAR EAST) 64631595 514000 125,7 0,63 0 20,48 7400 92,6 108,9 29,36 6,46 64,18 2 13,87 7,04 0,099 0,441 0,46
14 Iran ASIA (EX. NEAR EAST) 68688433 1648000 41,7 0,15 -0,84 41,58 7000 79,4 276,4 8,72 1,39 89,89 1 17 5,55 0,116 0,424 0,46
15 Turkey NEAR EAST 70413958 780580 90,2 0,92 0 41,04 6700 86,5 269,5 30,93 3,31 65,76 3 16,62 5,97 0,117 0,298 0,585
16 Ethiopia SUB-SAHARAN AFRICA 74777981 1127127 66,3 0,00 0 95,32 700 42,7 8,2 10,71 0,75 88,54 2 37,98 14,86 0,475 0,099 0,426
17 Egypt NORTHERN AFRICA 78887007 1001450 78,8 0,24 -0,22 32,59 4000 57,7 131,8 2,87 0,48 96,65 1 22,94 5,23 0,149 0,357 0,493
18 Germany WESTERN EUROPE 82422299 357021 230,9 0,67 2,18 4,16 27600 99,0 667,9 33,85 0,59 65,56 3 8,25 10,62 0,009 0,296 0,695
19 Vietnam ASIA (EX. NEAR EAST) 84402966 329560 256,1 1,05 -0,45 25,95 2500 90,3 187,7 19,97 5,95 74,08 2 16,86 6,22 0,209 0,41 0,381
20 Philippines ASIA (EX. NEAR EAST) 89468677 300000 298,2 12,10 -1,5 23,51 4600 92,6 38,4 18,95 16,77 64,28 2 24,89 5,41 0,144 0,326 0,53
21 Mexico LATIN AMER. & CARIB 107449525 1972550 54,5 0,47 -4,87 20,91 9000 92,2 181,6 12,99 1,31 85,7 1,5 20,69 4,74 0,038 0,259 0,702
22 Japan ASIA (EX. NEAR EAST) 127463611 377835 337,4 7,87 0 3,26 28200 99,0 461,2 12,19 0,96 86,85 3 9,37 9,16 0,017 0,258 0,725
23 Nigeria SUB-SAHARAN AFRICA 131859731 923768 142,7 0,09 0,26 98,8 900 68,0 9,3 31,29 2,96 65,75 1,5 40,43 16,94 0,269 0,487 0,244
24 Russia C.W. OF IND. STATES 142893540 17075200 8,4 0,22 1,02 15,39 8900 99,6 280,6 7,33 0,11 92,56 None 9,95 14,65 0,054 0,371 0,575
25 Bangladesh ASIA (EX. NEAR EAST) 147365352 144000 1023,4 0,40 -0,71 62,6 1900 43,1 7,3 62,11 3,07 34,82 2 29,8 8,27 0,199 0,198 0,603
26 Pakistan ASIA (EX. NEAR EAST) 165803560 803940 206,2 0,13 -2,77 72,44 2100 45,7 31,8 27,87 0,87 71,26 1 29,74 8,23 0,216 0,251 0,533
27 Brazil LATIN AMER. & CARIB 188078227 8511965 22,1 0,09 -0,03 29,61 7600 86,4 225,3 6,96 0,9 92,15 2 16,56 6,17 0,084 0,4 0,516
28 Indonesia ASIA (EX. NEAR EAST) 245452739 1919440 127,9 2,85 0 35,6 3200 87,9 52,0 11,32 7,23 81,45 2 20,34 6,25 0,134 0,458 0,408
29 United States NORTHERN AMERICA 298444215 9631420 31,0 0,21 3,41 6,5 37800 97,0 898,0 19,13 0,22 80,65 3 14,14 8,26 0,01 0,204 0,787
30 India ASIA (EX. NEAR EAST) 1095351995 3287590 333,2 0,21 -0,07 56,29 2900 59,5 45,4 54,4 2,74 42,86 2,5 22,01 8,18 0,186 0,276 0,538
31 China ASIA (EX. NEAR EAST) 1313973713 9596960 136,9 0,15 -0,4 24,18 5000 90,9 266,7 15,4 1,25 83,35 1,5 13,25 6,97 0,125 0,473 0,403

Select all the countries from Europe

In [216]:
countries.select("Country", "Population").where(col("Region").like("%EUROPE%")).show()
+--------------------+----------+
|             Country|Population|
+--------------------+----------+
|            Albania |   3581655|
|            Andorra |     71201|
|            Austria |   8192880|
|            Belgium |  10379067|
|Bosnia & Herzegov...|   4498976|
|           Bulgaria |   7385367|
|            Croatia |   4494749|
|     Czech Republic |  10235455|
|            Denmark |   5450661|
|      Faroe Islands |     47246|
|            Finland |   5231372|
|             France |  60876136|
|            Germany |  82422299|
|          Gibraltar |     27928|
|             Greece |  10688058|
|           Guernsey |     65409|
|            Hungary |   9981334|
|            Iceland |    299388|
|            Ireland |   4062235|
|        Isle of Man |     75441|
+--------------------+----------+
only showing top 20 rows

Conditions in where clause can contain logical expressions.

In [217]:
countries.select("Country", "Population")\
.where((col("Region").like("%EUROPE%")) & (col("Population")> 10000000)).show()
+---------------+----------+
|        Country|Population|
+---------------+----------+
|       Belgium |  10379067|
|Czech Republic |  10235455|
|        France |  60876136|
|       Germany |  82422299|
|        Greece |  10688058|
|         Italy |  58133509|
|   Netherlands |  16491461|
|        Poland |  38536869|
|      Portugal |  10605870|
|       Romania |  22303552|
|         Spain |  40397842|
|United Kingdom |  60609153|
+---------------+----------+

Aggregation

We can run aggregations with predefined functions (faster!):

In [218]:
from pyspark.sql.functions import sum
In [219]:
pd_countries = countries.select("Region", "Population").groupBy("Region").agg(sum("Population")).toPandas()
In [220]:
pd_countries
Out[220]:
Region sum(Population)
0 BALTICS 7184974
1 C.W. OF IND. STATES 280081548
2 ASIA (EX. NEAR EAST) 3687982236
3 WESTERN EUROPE 396339998
4 NORTHERN AMERICA 331672307
5 NEAR EAST 195068377
6 EASTERN EUROPE 119914717
7 OCEANIA 33131662
8 SUB-SAHARAN AFRICA 749437000
9 NORTHERN AFRICA 161407133
10 LATIN AMER. & CARIB 561824599

We can make the column name look better, by using alias:

In [221]:
pd_countries = countries.select("Region", "Population").groupBy("Region").agg(sum("Population").alias('Total')).toPandas()
In [222]:
pd_countries
Out[222]:
Region Total
0 BALTICS 7184974
1 C.W. OF IND. STATES 280081548
2 ASIA (EX. NEAR EAST) 3687982236
3 WESTERN EUROPE 396339998
4 NORTHERN AMERICA 331672307
5 NEAR EAST 195068377
6 EASTERN EUROPE 119914717
7 OCEANIA 33131662
8 SUB-SAHARAN AFRICA 749437000
9 NORTHERN AFRICA 161407133
10 LATIN AMER. & CARIB 561824599

Plot examples

Pandas DataFrames are useful for plotting using MatPlotLib:

In [223]:
pd_countries.plot(x='Region', y='Total',kind='bar', figsize=(10, 6))
Out[223]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8a499bf5b0>

User defined functions for data manipulation

Our countries DataFrame has some problems:

  • missing values
  • some numbers use comma instead of point as floating point separator (e.g. Literacy = 99,4)

We can clean the data using User Defined Functions (UDF)

In [224]:
from pyspark.sql.types import FloatType
from pyspark.sql.functions import udf

Define a Python function which coverts numbers with commas to float

In [225]:
def to_float (s) :
   return float(s.replace(',','.'))

Test that it works:

In [226]:
to_float('0,99')
Out[226]:
0.99

Now define a Spark UDF:

In [227]:
float_udf = udf(to_float , FloatType())

Test it on a Data Frame

In [228]:
countries.withColumn("Literacy", float_udf("Literacy (%)"))
Out[228]:
DataFrame[Country: string, Region: string, Population: int, Area (sq. mi.): int, Pop. Density (per sq. mi.): string, Coastline (coast/area ratio): string, Net migration: string, Infant mortality (per 1000 births): string, GDP ($ per capita): int, Literacy (%): string, Phones (per 1000): string, Arable (%): string, Crops (%): string, Other (%): string, Climate: string, Birthrate: string, Deathrate: string, Agriculture: string, Industry: string, Service: string, Literacy: float]

OK, we can see that the Literacy is now float

In [229]:
countries.show(50)
+--------------------+--------------------+----------+--------------+--------------------------+----------------------------+-------------+----------------------------------+------------------+------------+-----------------+----------+---------+---------+-------+---------+---------+-----------+--------+-------+
|             Country|              Region|Population|Area (sq. mi.)|Pop. Density (per sq. mi.)|Coastline (coast/area ratio)|Net migration|Infant mortality (per 1000 births)|GDP ($ per capita)|Literacy (%)|Phones (per 1000)|Arable (%)|Crops (%)|Other (%)|Climate|Birthrate|Deathrate|Agriculture|Industry|Service|
+--------------------+--------------------+----------+--------------+--------------------------+----------------------------+-------------+----------------------------------+------------------+------------+-----------------+----------+---------+---------+-------+---------+---------+-----------+--------+-------+
|        Afghanistan |ASIA (EX. NEAR EA...|  31056997|        647500|                      48,0|                        0,00|        23,06|                            163,07|               700|        36,0|              3,2|     12,13|     0,22|    87,65|      1|     46,6|    20,34|       0,38|    0,24|   0,38|
|            Albania |EASTERN EUROPE   ...|   3581655|         28748|                     124,6|                        1,26|        -4,93|                             21,52|              4500|        86,5|             71,2|     21,09|     4,42|    74,49|      3|    15,11|     5,22|      0,232|   0,188|  0,579|
|            Algeria |NORTHERN AFRICA  ...|  32930091|       2381740|                      13,8|                        0,04|        -0,39|                                31|              6000|        70,0|             78,1|      3,22|     0,25|    96,53|      1|    17,14|     4,61|      0,101|     0,6|  0,298|
|     American Samoa |OCEANIA          ...|     57794|           199|                     290,4|                       58,29|       -20,71|                              9,27|              8000|        97,0|            259,5|        10|       15|       75|      2|    22,46|     3,27|       null|    null|   null|
|            Andorra |WESTERN EUROPE   ...|     71201|           468|                     152,1|                        0,00|          6,6|                              4,05|             19000|       100,0|            497,2|      2,22|        0|    97,78|      3|     8,71|     6,25|       null|    null|   null|
|             Angola |SUB-SAHARAN AFRIC...|  12127071|       1246700|                       9,7|                        0,13|            0|                            191,19|              1900|        42,0|              7,8|      2,41|     0,24|    97,35|   null|    45,11|     24,2|      0,096|   0,658|  0,246|
|           Anguilla |LATIN AMER. & CAR...|     13477|           102|                     132,1|                       59,80|        10,76|                             21,03|              8600|        95,0|            460,0|         0|        0|      100|      2|    14,17|     5,34|       0,04|    0,18|   0,78|
|  Antigua & Barbuda |LATIN AMER. & CAR...|     69108|           443|                     156,0|                       34,54|        -6,15|                             19,46|             11000|        89,0|            549,9|     18,18|     4,55|    77,27|      2|    16,93|     5,37|      0,038|    0,22|  0,743|
|          Argentina |LATIN AMER. & CAR...|  39921833|       2766890|                      14,4|                        0,18|         0,61|                             15,18|             11200|        97,1|            220,4|     12,31|     0,48|    87,21|      3|    16,73|     7,55|      0,095|   0,358|  0,547|
|            Armenia |C.W. OF IND. STATES |   2976372|         29800|                      99,9|                        0,00|        -6,47|                             23,28|              3500|        98,6|            195,7|     17,55|      2,3|    80,15|      4|    12,07|     8,23|      0,239|   0,343|  0,418|
|              Aruba |LATIN AMER. & CAR...|     71891|           193|                     372,5|                       35,49|            0|                              5,89|             28000|        97,0|            516,1|     10,53|        0|    89,47|      2|    11,03|     6,68|      0,004|   0,333|  0,663|
|          Australia |OCEANIA          ...|  20264082|       7686850|                       2,6|                        0,34|         3,98|                              4,69|             29000|       100,0|            565,5|      6,55|     0,04|    93,41|      1|    12,14|     7,51|      0,038|   0,262|    0,7|
|            Austria |WESTERN EUROPE   ...|   8192880|         83870|                      97,7|                        0,00|            2|                              4,66|             30000|        98,0|            452,2|     16,91|     0,86|    82,23|      3|     8,74|     9,76|      0,018|   0,304|  0,678|
|         Azerbaijan |C.W. OF IND. STATES |   7961619|         86600|                      91,9|                        0,00|         -4,9|                             81,74|              3400|        97,0|            137,1|     19,63|     2,71|    77,66|      1|    20,74|     9,75|      0,141|   0,457|  0,402|
|       Bahamas, The |LATIN AMER. & CAR...|    303770|         13940|                      21,8|                       25,41|         -2,2|                             25,21|             16700|        95,6|            460,6|       0,8|      0,4|     98,8|      2|    17,57|     9,05|       0,03|    0,07|    0,9|
|            Bahrain |NEAR EAST        ...|    698585|           665|                    1050,5|                       24,21|         1,05|                             17,27|             16900|        89,1|            281,3|      2,82|     5,63|    91,55|      1|     17,8|     4,14|      0,005|   0,387|  0,608|
|         Bangladesh |ASIA (EX. NEAR EA...| 147365352|        144000|                    1023,4|                        0,40|        -0,71|                              62,6|              1900|        43,1|              7,3|     62,11|     3,07|    34,82|      2|     29,8|     8,27|      0,199|   0,198|  0,603|
|           Barbados |LATIN AMER. & CAR...|    279912|           431|                     649,5|                       22,51|        -0,31|                              12,5|             15700|        97,4|            481,9|     37,21|     2,33|    60,46|      2|    12,71|     8,67|       0,06|    0,16|   0,78|
|            Belarus |C.W. OF IND. STATES |  10293011|        207600|                      49,6|                        0,00|         2,54|                             13,37|              6100|        99,6|            319,1|     29,55|      0,6|    69,85|      4|    11,16|    14,02|      0,093|   0,316|  0,591|
|            Belgium |WESTERN EUROPE   ...|  10379067|         30528|                     340,0|                        0,22|         1,23|                              4,68|             29100|        98,0|            462,6|     23,28|      0,4|    76,32|      3|    10,38|    10,27|       0,01|    0,24|  0,749|
|             Belize |LATIN AMER. & CAR...|    287730|         22966|                      12,5|                        1,68|            0|                             25,69|              4900|        94,1|            115,7|      2,85|     1,71|    95,44|      2|    28,84|     5,72|      0,142|   0,152|  0,612|
|              Benin |SUB-SAHARAN AFRIC...|   7862944|        112620|                      69,8|                        0,11|            0|                                85|              1100|        40,9|              9,7|     18,08|      2,4|    79,52|      2|    38,85|    12,22|      0,316|   0,138|  0,546|
|            Bermuda |NORTHERN AMERICA ...|     65773|            53|                    1241,0|                      194,34|         2,49|                              8,53|             36000|        98,0|            851,4|        20|        0|       80|      2|     11,4|     7,74|       0,01|     0,1|   0,89|
|             Bhutan |ASIA (EX. NEAR EA...|   2279723|         47000|                      48,5|                        0,00|            0|                            100,44|              1300|        42,2|             14,3|      3,09|     0,43|    96,48|      2|    33,65|     12,7|      0,258|   0,379|  0,363|
|            Bolivia |LATIN AMER. & CAR...|   8989046|       1098580|                       8,2|                        0,00|        -1,32|                             53,11|              2400|        87,2|             71,9|      2,67|     0,19|    97,14|    1,5|     23,3|     7,53|      0,128|   0,352|   0,52|
|Bosnia & Herzegov...|EASTERN EUROPE   ...|   4498976|         51129|                      88,0|                        0,04|         0,31|                             21,05|              6100|        null|            215,4|      13,6|     2,96|    83,44|      4|     8,77|     8,27|      0,142|   0,308|   0,55|
|           Botswana |SUB-SAHARAN AFRIC...|   1639833|        600370|                       2,7|                        0,00|            0|                             54,58|              9000|        79,8|             80,5|      0,65|     0,01|    99,34|      1|    23,08|     29,5|      0,024|   0,469|  0,507|
|             Brazil |LATIN AMER. & CAR...| 188078227|       8511965|                      22,1|                        0,09|        -0,03|                             29,61|              7600|        86,4|            225,3|      6,96|      0,9|    92,15|      2|    16,56|     6,17|      0,084|     0,4|  0,516|
| British Virgin Is. |LATIN AMER. & CAR...|     23098|           153|                     151,0|                       52,29|        10,01|                             18,05|             16000|        97,8|            506,5|        20|     6,67|    73,33|      2|    14,89|     4,42|      0,018|   0,062|   0,92|
|             Brunei |ASIA (EX. NEAR EA...|    379444|          5770|                      65,8|                        2,79|         3,59|                             12,61|             18600|        93,9|            237,2|      0,57|     0,76|    98,67|      2|    18,79|     3,45|      0,036|   0,561|  0,403|
|           Bulgaria |EASTERN EUROPE   ...|   7385367|        110910|                      66,6|                        0,32|        -4,58|                             20,55|              7600|        98,6|            336,3|     40,02|     1,92|    58,06|      3|     9,65|    14,27|      0,093|   0,304|  0,603|
|       Burkina Faso |SUB-SAHARAN AFRIC...|  13902972|        274200|                      50,7|                        0,00|            0|                             97,57|              1100|        26,6|              7,0|     14,43|     0,19|    85,38|      2|    45,62|     15,6|      0,322|   0,196|  0,482|
|              Burma |ASIA (EX. NEAR EA...|  47382633|        678500|                      69,8|                        0,28|         -1,8|                             67,24|              1800|        85,3|             10,1|     15,19|     0,97|    83,84|      2|    17,91|     9,83|      0,564|   0,082|  0,353|
|            Burundi |SUB-SAHARAN AFRIC...|   8090068|         27830|                     290,7|                        0,00|        -0,06|                             69,29|               600|        51,6|              3,4|     35,05|    14,02|    50,93|      2|    42,22|    13,46|      0,463|   0,203|  0,334|
|           Cambodia |ASIA (EX. NEAR EA...|  13881427|        181040|                      76,7|                        0,24|            0|                             71,48|              1900|        69,4|              2,6|     20,96|     0,61|    78,43|      2|     26,9|     9,06|       0,35|     0,3|   0,35|
|           Cameroon |SUB-SAHARAN AFRIC...|  17340702|        475440|                      36,5|                        0,08|            0|                             68,26|              1800|        79,0|              5,7|     12,81|     2,58|    84,61|    1,5|    33,89|    13,47|      0,448|    0,17|  0,382|
|             Canada |NORTHERN AMERICA ...|  33098932|       9984670|                       3,3|                        2,02|         5,96|                              4,75|             29800|        97,0|            552,2|      4,96|     0,02|    95,02|   null|    10,78|      7,8|      0,022|   0,294|  0,684|
|         Cape Verde |SUB-SAHARAN AFRIC...|    420979|          4033|                     104,4|                       23,93|       -12,07|                             47,77|              1400|        76,6|            169,6|      9,68|      0,5|    89,82|      3|    24,87|     6,55|      0,121|   0,219|   0,66|
|     Cayman Islands |LATIN AMER. & CAR...|     45436|           262|                     173,4|                       61,07|        18,75|                              8,19|             35000|        98,0|            836,3|      3,85|        0|    96,15|      2|    12,74|     4,89|      0,014|   0,032|  0,954|
|Central African R...|SUB-SAHARAN AFRIC...|   4303356|        622984|                       6,9|                        0,00|            0|                                91|              1100|        51,0|              2,3|       3,1|     0,14|    96,76|      2|    33,91|    18,65|       0,55|     0,2|   0,25|
|               Chad |SUB-SAHARAN AFRIC...|   9944201|       1284000|                       7,7|                        0,00|        -0,11|                             93,82|              1200|        47,5|              1,3|      2,86|     0,02|    97,12|      2|    45,73|    16,38|      0,335|   0,259|  0,406|
|              Chile |LATIN AMER. & CAR...|  16134219|        756950|                      21,3|                        0,85|            0|                               8,8|              9900|        96,2|            213,0|      2,65|     0,42|    96,93|      3|    15,23|     5,81|       0,06|   0,493|  0,447|
|              China |ASIA (EX. NEAR EA...|1313973713|       9596960|                     136,9|                        0,15|         -0,4|                             24,18|              5000|        90,9|            266,7|      15,4|     1,25|    83,35|    1,5|    13,25|     6,97|      0,125|   0,473|  0,403|
|           Colombia |LATIN AMER. & CAR...|  43593035|       1138910|                      38,3|                        0,28|        -0,31|                             20,97|              6300|        92,5|            176,2|      2,42|     1,67|    95,91|      2|    20,48|     5,58|      0,125|   0,342|  0,533|
|            Comoros |SUB-SAHARAN AFRIC...|    690948|          2170|                     318,4|                       15,67|            0|                             74,93|               700|        56,5|             24,5|     35,87|    23,32|    40,81|      2|    36,93|      8,2|        0,4|    0,04|   0,56|
|   Congo, Dem. Rep. |SUB-SAHARAN AFRIC...|  62660551|       2345410|                      26,7|                        0,00|            0|                             94,69|               700|        65,5|              0,2|      2,96|     0,52|    96,52|      2|    43,69|    13,27|       0,55|    0,11|   0,34|
|Congo, Repub. of ...|SUB-SAHARAN AFRIC...|   3702314|        342000|                      10,8|                        0,05|        -0,17|                             93,86|               700|        83,8|              3,7|      0,51|     0,13|    99,36|      2|    42,57|    12,93|      0,062|    0,57|  0,369|
|       Cook Islands |OCEANIA          ...|     21388|           240|                      89,1|                       50,00|         null|                              null|              5000|        95,0|            289,9|     17,39|    13,04|    69,57|      2|       21|     null|      0,151|   0,096|  0,753|
|         Costa Rica |LATIN AMER. & CAR...|   4075261|         51100|                      79,8|                        2,52|         0,51|                              9,95|              9100|        96,0|            340,7|      4,41|     5,88|    89,71|      2|    18,32|     4,36|      0,088|   0,299|  0,614|
|      Cote d'Ivoire |SUB-SAHARAN AFRIC...|  17654843|        322460|                      54,8|                        0,16|        -0,07|                             90,83|              1400|        50,9|             14,6|      9,75|    13,84|    76,41|      2|    35,11|    14,84|      0,279|   0,171|   0,55|
+--------------------+--------------------+----------+--------------+--------------------------+----------------------------+-------------+----------------------------------+------------------+------------+-----------------+----------+---------+---------+-------+---------+---------+-----------+--------+-------+
only showing top 50 rows

In [230]:
# countries.where((col("Literacy") < 50) & (col("GDP ($ per capita)") > 700)).show()

Oops, what does it mean???

  • some rows have empty values!

Before we can use the table, we need to remove empty rows. Otherwise our UDF will fail.

In [231]:
full_countries = countries.select('Country', 'Population', 'Literacy (%)', 'GDP ($ per capita)').na.drop()

We can now apply the new UDF to the Data Frame:

In [232]:
full_countries = full_countries.withColumn("Literacy", float_udf("Literacy (%)"))
In [233]:
full_countries.show(50)
+--------------------+----------+------------+------------------+--------+
|             Country|Population|Literacy (%)|GDP ($ per capita)|Literacy|
+--------------------+----------+------------+------------------+--------+
|        Afghanistan |  31056997|        36,0|               700|    36.0|
|            Albania |   3581655|        86,5|              4500|    86.5|
|            Algeria |  32930091|        70,0|              6000|    70.0|
|     American Samoa |     57794|        97,0|              8000|    97.0|
|            Andorra |     71201|       100,0|             19000|   100.0|
|             Angola |  12127071|        42,0|              1900|    42.0|
|           Anguilla |     13477|        95,0|              8600|    95.0|
|  Antigua & Barbuda |     69108|        89,0|             11000|    89.0|
|          Argentina |  39921833|        97,1|             11200|    97.1|
|            Armenia |   2976372|        98,6|              3500|    98.6|
|              Aruba |     71891|        97,0|             28000|    97.0|
|          Australia |  20264082|       100,0|             29000|   100.0|
|            Austria |   8192880|        98,0|             30000|    98.0|
|         Azerbaijan |   7961619|        97,0|              3400|    97.0|
|       Bahamas, The |    303770|        95,6|             16700|    95.6|
|            Bahrain |    698585|        89,1|             16900|    89.1|
|         Bangladesh | 147365352|        43,1|              1900|    43.1|
|           Barbados |    279912|        97,4|             15700|    97.4|
|            Belarus |  10293011|        99,6|              6100|    99.6|
|            Belgium |  10379067|        98,0|             29100|    98.0|
|             Belize |    287730|        94,1|              4900|    94.1|
|              Benin |   7862944|        40,9|              1100|    40.9|
|            Bermuda |     65773|        98,0|             36000|    98.0|
|             Bhutan |   2279723|        42,2|              1300|    42.2|
|            Bolivia |   8989046|        87,2|              2400|    87.2|
|           Botswana |   1639833|        79,8|              9000|    79.8|
|             Brazil | 188078227|        86,4|              7600|    86.4|
| British Virgin Is. |     23098|        97,8|             16000|    97.8|
|             Brunei |    379444|        93,9|             18600|    93.9|
|           Bulgaria |   7385367|        98,6|              7600|    98.6|
|       Burkina Faso |  13902972|        26,6|              1100|    26.6|
|              Burma |  47382633|        85,3|              1800|    85.3|
|            Burundi |   8090068|        51,6|               600|    51.6|
|           Cambodia |  13881427|        69,4|              1900|    69.4|
|           Cameroon |  17340702|        79,0|              1800|    79.0|
|             Canada |  33098932|        97,0|             29800|    97.0|
|         Cape Verde |    420979|        76,6|              1400|    76.6|
|     Cayman Islands |     45436|        98,0|             35000|    98.0|
|Central African R...|   4303356|        51,0|              1100|    51.0|
|               Chad |   9944201|        47,5|              1200|    47.5|
|              Chile |  16134219|        96,2|              9900|    96.2|
|              China |1313973713|        90,9|              5000|    90.9|
|           Colombia |  43593035|        92,5|              6300|    92.5|
|            Comoros |    690948|        56,5|               700|    56.5|
|   Congo, Dem. Rep. |  62660551|        65,5|               700|    65.5|
|Congo, Repub. of ...|   3702314|        83,8|               700|    83.8|
|       Cook Islands |     21388|        95,0|              5000|    95.0|
|         Costa Rica |   4075261|        96,0|              9100|    96.0|
|      Cote d'Ivoire |  17654843|        50,9|              1400|    50.9|
|            Croatia |   4494749|        98,5|             10600|    98.5|
+--------------------+----------+------------+------------------+--------+
only showing top 50 rows

In [234]:
full_countries.where((col("Literacy") < 50) & (col("GDP ($ per capita)") > 700)).show()
+--------------+----------+------------+------------------+--------+
|       Country|Population|Literacy (%)|GDP ($ per capita)|Literacy|
+--------------+----------+------------+------------------+--------+
|       Angola |  12127071|        42,0|              1900|    42.0|
|   Bangladesh | 147365352|        43,1|              1900|    43.1|
|        Benin |   7862944|        40,9|              1100|    40.9|
|       Bhutan |   2279723|        42,2|              1300|    42.2|
| Burkina Faso |  13902972|        26,6|              1100|    26.6|
|         Chad |   9944201|        47,5|              1200|    47.5|
|  Gambia, The |   1641564|        40,1|              1700|    40.1|
|       Guinea |   9690222|        35,9|              2100|    35.9|
|Guinea-Bissau |   1442029|        42,4|               800|    42.4|
|         Iraq |  26783383|        40,4|              1500|    40.4|
|         Mali |  11716829|        46,4|               900|    46.4|
|   Mauritania |   3177388|        41,7|              1800|    41.7|
|   Mozambique |  19686505|        47,8|              1200|    47.8|
|        Nepal |  28287147|        45,2|              1400|    45.2|
|        Niger |  12525094|        17,6|               800|    17.6|
|     Pakistan | 165803560|        45,7|              2100|    45.7|
|      Senegal |  11987121|        40,2|              1600|    40.2|
+--------------+----------+------------+------------------+--------+

In [235]:
full_countries.toPandas().plot(x="Literacy",y="GDP ($ per capita)",kind="scatter",figsize=(10, 6))
Out[235]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8a49bf2e80>

Task 1

Narysuj wykres zależności kolumn Birthrate i Deathrate w zależności od GDP per capita dla 30 najbogatszych (zależnośc Birthrate od GDP oraz Deathrate od GDP na jednym wykresie róznymi kolorami) i 30 najbiedniejszych krajów (podobnie)

In [236]:
countries.toPandas()
Out[236]:
Country Region Population Area (sq. mi.) Pop. Density (per sq. mi.) Coastline (coast/area ratio) Net migration Infant mortality (per 1000 births) GDP ($ per capita) Literacy (%) Phones (per 1000) Arable (%) Crops (%) Other (%) Climate Birthrate Deathrate Agriculture Industry Service
0 Afghanistan ASIA (EX. NEAR EAST) 31056997 647500 48,0 0,00 23,06 163,07 700.0 36,0 3,2 12,13 0,22 87,65 1 46,6 20,34 0,38 0,24 0,38
1 Albania EASTERN EUROPE 3581655 28748 124,6 1,26 -4,93 21,52 4500.0 86,5 71,2 21,09 4,42 74,49 3 15,11 5,22 0,232 0,188 0,579
2 Algeria NORTHERN AFRICA 32930091 2381740 13,8 0,04 -0,39 31 6000.0 70,0 78,1 3,22 0,25 96,53 1 17,14 4,61 0,101 0,6 0,298
3 American Samoa OCEANIA 57794 199 290,4 58,29 -20,71 9,27 8000.0 97,0 259,5 10 15 75 2 22,46 3,27 None None None
4 Andorra WESTERN EUROPE 71201 468 152,1 0,00 6,6 4,05 19000.0 100,0 497,2 2,22 0 97,78 3 8,71 6,25 None None None
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
222 West Bank NEAR EAST 2460492 5860 419,9 0,00 2,98 19,62 800.0 None 145,2 16,9 18,97 64,13 3 31,67 3,92 0,09 0,28 0,63
223 Western Sahara NORTHERN AFRICA 273008 266000 1,0 0,42 None None NaN None None 0,02 0 99,98 1 None None None None 0,4
224 Yemen NEAR EAST 21456188 527970 40,6 0,36 0 61,5 800.0 50,2 37,2 2,78 0,24 96,98 1 42,89 8,3 0,135 0,472 0,393
225 Zambia SUB-SAHARAN AFRICA 11502010 752614 15,3 0,00 0 88,29 800.0 80,6 8,2 7,08 0,03 92,9 2 41 19,93 0,22 0,29 0,489
226 Zimbabwe SUB-SAHARAN AFRICA 12236805 390580 31,3 0,00 0 67,69 1900.0 90,7 26,8 8,32 0,34 91,34 2 28,01 21,84 0,179 0,243 0,579

227 rows × 20 columns

In [237]:
countries_plot = countries.select('Country', 'GDP ($ per capita)', 'Birthrate', 'Deathrate').na.drop()

Top 30 richest countries

In [238]:
top_30 = countries_plot.withColumn("Birthrate", float_udf("Birthrate"))\
                                .withColumn("Deathrate", float_udf("Deathrate"))\
                                .withColumn("GDP ($ per capita)", col("GDP ($ per capita)").cast("float"))\
                                .orderBy("GDP ($ per capita)", ascending=False)\
                                .limit(30)
In [239]:
top_30_df = top_30.toPandas()
In [240]:
top_30_df.head(5)
Out[240]:
Country GDP ($ per capita) Birthrate Deathrate
0 Luxembourg 55100.0 11.94 8.41
1 Norway 37800.0 11.46 9.40
2 United States 37800.0 14.14 8.26
3 Bermuda 36000.0 11.40 7.74
4 Cayman Islands 35000.0 12.74 4.89
In [317]:
plt.figure(figsize=(10, 6))
plt.scatter(top_30_df['Birthrate'], top_30_df['GDP ($ per capita)'], label='Birthrate')
plt.scatter(top_30_df['Deathrate'], top_30_df['GDP ($ per capita)'], label='Deathrate')
plt.ylabel('GDP ($ per capita)')
plt.legend()
plt.show()

The richest countries often are characterised by similar level of Birthrate and Deathrate.

Bottom 30 - 30 poorest countries

In [242]:
bot_30 = countries_plot.withColumn("Birthrate", float_udf("Birthrate"))\
                                .withColumn("Deathrate", float_udf("Deathrate"))\
                                .withColumn("GDP ($ per capita)", col("GDP ($ per capita)").cast("float"))\
                                .orderBy("GDP ($ per capita)", ascending=True)\
                                .limit(30)
In [316]:
bot_30_df = bot_30.toPandas()
plt.figure(figsize=(10, 6))
plt.scatter(bot_30_df['Birthrate'], bot_30_df['GDP ($ per capita)'], label='Birthrate')
plt.scatter(bot_30_df['Deathrate'], bot_30_df['GDP ($ per capita)'], label='Deathrate')
plt.ylabel('GDP ($ per capita)')
plt.legend()
plt.show()

The poorest countries are characterised by greater Birthrate than Deathrate.

Task 2

Wczytaj zbiór danych "airports.csv" ze strony https://www.kaggle.com/jonatancr/airports (zawiera on dane na temat lotnisk z całego świata). Ponieważ w zbiorze tym nie ma nagłówka, nadaj własne nazwy kolumnom, korzystając z opisu pliku na tej stronie. Narysuj na wykresie (mapie) położenia wszystkich/wybranych lotnisk.

Read and rename

In [244]:
airports = spark.read.csv("airports.csv",header=False)
airports.toPandas().head(2)
Out[244]:
_c0 _c1 _c2 _c3 _c4 _c5 _c6 _c7 _c8 _c9 _c10 _c11 _c12 _c13
0 1 Goroka Airport Goroka Papua New Guinea GKA AYGA -6.081689834590001 145.391998291 5282 10 U Pacific/Port_Moresby airport OurAirports
1 2 Madang Airport Madang Papua New Guinea MAG AYMD -5.20707988739 145.789001465 20 10 U Pacific/Port_Moresby airport OurAirports
In [245]:
airports = airports.toDF("AirportID", "Name", "City", "Country", "IATA", "ICAO", "Latitude", 
                         "Longitude", "Altitude", "Timezone", "DST", " Timezone", "Type", "Source")
airports.toPandas().head(2)
Out[245]:
AirportID Name City Country IATA ICAO Latitude Longitude Altitude Timezone DST Timezone Type Source
0 1 Goroka Airport Goroka Papua New Guinea GKA AYGA -6.081689834590001 145.391998291 5282 10 U Pacific/Port_Moresby airport OurAirports
1 2 Madang Airport Madang Papua New Guinea MAG AYMD -5.20707988739 145.789001465 20 10 U Pacific/Port_Moresby airport OurAirports

Select data and plot

In [246]:
air_points_pd = airports.select("Name", "Latitude", "Longitude").toPandas()
air_points_pd.head(3)
Out[246]:
Name Latitude Longitude
0 Goroka Airport -6.081689834590001 145.391998291
1 Madang Airport -5.20707988739 145.789001465
2 Mount Hagen Kagamuga Airport -5.826789855957031 144.29600524902344
In [247]:
import folium
m = folium.Map()
for p in range(0, len(air_points_pd)):
    folium.Marker((air_points_pd.iloc[p, 1], air_points_pd.iloc[p, 2]), popup=air_points_pd['Name'][p]).add_to(m)
m
Out[247]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Task 3

Znajdź 10 krajów w których najniżej położone lotnisko ma najwyższą wysokość, podając wyniki w m. n. p. m.

In [248]:
airports.toPandas().head(2)
Out[248]:
AirportID Name City Country IATA ICAO Latitude Longitude Altitude Timezone DST Timezone Type Source
0 1 Goroka Airport Goroka Papua New Guinea GKA AYGA -6.081689834590001 145.391998291 5282 10 U Pacific/Port_Moresby airport OurAirports
1 2 Madang Airport Madang Papua New Guinea MAG AYMD -5.20707988739 145.789001465 20 10 U Pacific/Port_Moresby airport OurAirports
In [249]:
airports = airports.withColumn("Altitude", col("Altitude").cast(FloatType()))
In [250]:
from pyspark.sql import functions as F

low = airports.groupBy("Country")\
                            .agg(F.first("Name").alias("Name"), F.min("Altitude").alias("Altitude"))
low.toPandas()
Out[250]:
Country Name Altitude
0 Chad Abeche Airport 771.0
1 Anguilla Wallblake Airport 127.0
2 Paraguay Silvio Pettirossi International Airport 223.0
3 Russia Yakutsk Airport -65.0
4 British Indian Ocean Territory Diego Garcia Naval Support Facility 9.0
... ... ... ...
232 Moldova Balti International Airport 141.0
233 United Kingdom Belfast International Airport 0.0
234 Vietnam Da Nang International Airport 6.0
235 Mali Senou Airport 164.0
236 Netherlands Amsterdam Airport Schiphol -15.0

237 rows × 3 columns

In [251]:
top = low.orderBy("Altitude", ascending=False)

Cast feets to meters

In [252]:
top = top.withColumn("Altitude [m]", col("Altitude") * 0.3048)\
        .limit(10)
top.toPandas()
Out[252]:
Country Name Altitude Altitude [m]
0 Lesotho Moshoeshoe I International Airport 5105.0 1556.0040
1 Rwanda Gisenyi Airport 4859.0 1481.0232
2 Burundi Bujumbura International Airport 2582.0 786.9936
3 West Bank Jerusalem Airport 2485.0 757.4280
4 Uganda Entebbe International Airport 2472.0 753.4656
5 Swaziland Matsapha Airport 2075.0 632.4600
6 Kyrgyzstan Manas International Airport 2058.0 627.2784
7 Botswana Francistown Airport 1772.0 540.1056
8 Malawi Chileka International Airport 1580.0 481.5840
9 South Sudan Juba International Airport 1513.0 461.1624

Task 4

Na wykresie scatterplot narysuj zależność liczby lotnisk od powierzchni kraju.

  • Dane potrzebne do wykresu są w dwóch ramkach (Countries i Airports), konieczne więc będzie ich złączenie (join).
  • Nie zawsze nazwy tych samych krajów będą identyczne w obu ramkach, co może stanowić problem (złączenie nie powstanie). Zastanów się jak przy pomocy mechanizmu złączeń wykryć, które nazwy się różnią. Stwórz odpowiednią ramkę, która zawiera takie nazwy.
In [253]:
sorted(airports.toPandas()['Country'].unique())
airports = airports.orderBy('Country')
airports.toPandas().head(2)
Out[253]:
AirportID Name City Country IATA ICAO Latitude Longitude Altitude Timezone DST Timezone Type Source
0 2048 Herat Airport Herat Afghanistan HEA OAHR 34.209999084472656 62.22829818725586 3206.0 4.5 U Asia/Kabul airport OurAirports
1 2049 Jalalabad Airport Jalalabad Afghanistan JAA OAJL 34.39979934692383 70.49859619140625 1814.0 4.5 U Asia/Kabul airport OurAirports
In [254]:
sorted(countries.toPandas()['Country'].unique())
countries = countries.orderBy('Country')
countries.toPandas().head(2)
Out[254]:
Country Region Population Area (sq. mi.) Pop. Density (per sq. mi.) Coastline (coast/area ratio) Net migration Infant mortality (per 1000 births) GDP ($ per capita) Literacy (%) Phones (per 1000) Arable (%) Crops (%) Other (%) Climate Birthrate Deathrate Agriculture Industry Service
0 Afghanistan ASIA (EX. NEAR EAST) 31056997 647500 48,0 0,00 23,06 163,07 700.0 36,0 3,2 12,13 0,22 87,65 1 46,6 20,34 0,38 0,24 0,38
1 Albania EASTERN EUROPE 3581655 28748 124,6 1,26 -4,93 21,52 4500.0 86,5 71,2 21,09 4,42 74,49 3 15,11 5,22 0,232 0,188 0,579

Remove white spaces

In [292]:
from pyspark.sql.functions import upper, count, isnull, trim

airports = airports.withColumn('Country', trim(airports['Country']))
countries = countries.withColumn('Country', trim(countries['Country']))
In [293]:
merged = airports.alias('a').join(countries.alias('c'), on=airports['Country'] == countries['Country'], how='left')

result = merged.groupBy('a.Country', 'c.`Area (sq. mi.)`').agg(count('a.AirportID').alias('Amount'))

result_sorted = result.orderBy(col('Amount').desc())

result_sorted.show()
+--------------+--------------+------+
|       Country|Area (sq. mi.)|Amount|
+--------------+--------------+------+
| United States|       9631420|  1435|
|        Canada|       9984670|   417|
|     Australia|       7686850|   296|
|       Germany|        357021|   241|
|        Russia|      17075200|   238|
|        Brazil|       8511965|   234|
|        France|        547030|   214|
|         China|       9596960|   180|
|United Kingdom|        244820|   162|
|         India|       3287590|   125|
|     Indonesia|       1919440|   125|
|         Japan|        377835|   119|
|  South Africa|       1219912|    96|
|     Argentina|       2766890|    94|
|        Mexico|       1972550|    83|
|         Italy|        301230|    82|
|          Iran|       1648000|    81|
|        Sweden|        449964|    77|
|      Colombia|       1138910|    74|
|        Turkey|        780580|    72|
+--------------+--------------+------+
only showing top 20 rows

Not matched

Some of them are just part (islands) of another country like Anguilla (UK). Many of them had whitespaces, so I've removed them.

In [296]:
merged = airports.alias('a').join(
    countries.alias('c'), 
    on=col('a.Country') == col('c.Country'), 
    how='full_outer'
)

diff_countries = merged.filter(isnull(col('a.Country')) | isnull(col('c.Country')))

diff_countries.select('a.Country', 'c.Country').distinct().show(100)
+--------------------+--------------------+
|             Country|             Country|
+--------------------+--------------------+
|                null|Congo, Repub. of the|
|British Indian Oc...|                null|
|                null|        Korea, North|
|                null|    Congo, Dem. Rep.|
|      Norfolk Island|                null|
|Northern Mariana ...|                null|
|Turks and Caicos ...|                null|
|                null|Central African Rep.|
|British Virgin Is...|                null|
|                null|          San Marino|
|             Myanmar|                null|
|             Bahamas|                null|
|                null|Micronesia, Fed. St.|
|Sao Tome and Prin...|                null|
|                null|             Andorra|
|                null|  N. Mariana Islands|
|Saint Pierre and ...|                null|
|    Congo (Kinshasa)|                null|
|Central African R...|                null|
|                null|       Liechtenstein|
|            Svalbard|                null|
|    Christmas Island|                null|
|Cocos (Keeling) I...|                null|
|                null|         Gambia, The|
|                null|          Gaza Strip|
|                null|              Monaco|
|          Antarctica|                null|
|Saint Kitts and N...|                null|
|         North Korea|                null|
|         South Korea|                null|
|           Palestine|                null|
|          Montenegro|                null|
|                null|St Pierre & Miquelon|
| Trinidad and Tobago|                null|
| Congo (Brazzaville)|                null|
|      Midway Islands|                null|
|      Johnston Atoll|                null|
|         South Sudan|                null|
|                null|  British Virgin Is.|
|                null|        Korea, South|
|              Gambia|                null|
| Antigua and Barbuda|                null|
|    Falkland Islands|                null|
|                null|        Bahamas, The|
|Bosnia and Herzeg...|                null|
|                null|   Trinidad & Tobago|
|         Wake Island|                null|
|                null|Bosnia & Herzegovina|
|          Micronesia|                null|
|                Niue|                null|
|                null|   Antigua & Barbuda|
|                null| Sao Tome & Principe|
|                null| Saint Kitts & Nevis|
|                null|   Turks & Caicos Is|
+--------------------+--------------------+

In [297]:
result_sorted = result_sorted.toPandas()
In [315]:
plt.figure(figsize=(10, 6))
plt.scatter(result_sorted['Area (sq. mi.)'], result_sorted['Amount'], c='blue')
plt.xlabel('Country Area (sq. mi.)')
plt.ylabel('Amount of airports')
plt.title('Relationship between country area and airports number')
plt.show()

Task 5

Oblicz, ile lotnisk jest na poszczególnych kontynentach. To zadanie również wymaga złączenia ramek "Countries" i "Airports". Wynik przedstaw na wykresie kołowym.

In [320]:
merged = airports.alias('a').join(countries.alias('c'), on=airports['Country'] == countries['Country'], how='left')

result = merged.groupBy('c.Region').agg(count('a.AirportID').alias('Amount'))

result_sorted = result.orderBy(col('Amount').desc())

result_sorted.show()
+--------------------+------+
|              Region|Amount|
+--------------------+------+
|NORTHERN AMERICA ...|  1875|
|WESTERN EUROPE   ...|  1224|
|ASIA (EX. NEAR EA...|  1064|
|LATIN AMER. & CAR...|   921|
|OCEANIA          ...|   542|
|SUB-SAHARAN AFRIC...|   534|
|C.W. OF IND. STATES |   357|
|NEAR EAST        ...|   208|
|                null|   171|
|EASTERN EUROPE   ...|   143|
|NORTHERN AFRICA  ...|   121|
|BALTICS          ...|    24|
+--------------------+------+

In [321]:
result_sorted = result_sorted.toPandas()
In [342]:
plt.figure(figsize=(10, 10))
plt.pie(result_sorted['Amount'], labels = result_sorted['Region'], autopct='%1.1f%%', textprops={'fontsize': 14})
plt.title('Airports distribution by region', fontweight='bold', size=20)
plt.show()
In [ ]: